CREATED BY-

NAJEEB FARIDUDDIN SAIYED PRN 21070126057

KOTA SRINIVAS PRN 21070126050

Abstract

Customer churn is always one of the things that is very essential especially in telecommunication industry because they have to understand and analyze trends of customers if there going to unsubscribe from the firm or not. This is where Machine learning algorithm helps companies in predicting customers behaviours.

Introduction

What is a Customer Churn?

Customer churn is basically when subscribers or customers discontinue using a firms services.

Customers have a variety of option to select between telecom industry making the annual churn rate of telecommunication bussiness between 15-25 percent which is highly competetive.

Individualized customer retention is tough because most firms have a large number of customers and can't afford to devote much time to each of them. The costs would be too great, outweighing the additional revenue. However, if a corporation could forecast which customers are likely to leave ahead of time, it could focus customer retention efforts only on these "high risk" clients. The ultimate goal is to expand its coverage area and retrieve more customers loyalty. The core to succeed in this market lies in the customer itself.

A cool fun fact - "Did you know that attracting a new customer costs five times as much as keeping an existing one?"

To reduce customer churn, telecom companies need to predict which customers are at high risk of churn.

To find any signs of churn a company must develop a holistic view and their interactions with the services provided by the company inluding store/branch visits, product purchase histories, customer service calls, Web-based transactions, and social media interactions, etc.

By analysing their customers well, these businesses may stand against their fellow compititors but also grow and thrive forward. Thus the company's key focus for success is retaining customers and implementing a effective retention strategy.

2. Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
In [2]:
#reading dataset1
df = pd.read_csv("C:\\Users\\Najeeb\\Desktop\\WA_Fn-UseC_-Telco-Customer-Churn (1).csv")

3. Understanding the Data

In [3]:
#displaying first 5 rows of the dataset
df.head() 
Out[3]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

In [4]:
#displaying all columsn to get a better understanding of all the columns present in the dataset
df.columns
Out[4]:
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')
In [5]:
#shape of the data
df.shape
print("Number of rows are: ", df.shape[0])
print("Number of columns are: ", df.shape[1])
Number of rows are:  7043
Number of columns are:  21
In [6]:
df.index
Out[6]:
RangeIndex(start=0, stop=7043, step=1)
In [7]:
#displaying the information of this dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

describing the data

In [8]:
#describing the data
df.describe(include=np.object)
#here its displays all column which are of 'object' datatype
#cuz without it.. it would just display count, mean, etc for columns with integer datatype
#top gives the highest counted value of the categorical values.
C:\Users\Najeeb\AppData\Local\Temp/ipykernel_2928/3884528649.py:2: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. 
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  df.describe(include=np.object)
Out[8]:
customerID gender Partner Dependents PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod TotalCharges Churn
count 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043
unique 7043 2 2 2 2 3 3 3 3 3 3 3 3 3 2 4 6531 2
top 7590-VHVEG Male No No Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check No
freq 1 3555 3641 4933 6361 3390 3096 3498 3088 3095 3473 2810 2785 3875 4171 2365 11 5174
In [9]:
df.describe()
Out[9]:
SeniorCitizen tenure MonthlyCharges
count 7043.000000 7043.000000 7043.000000
mean 0.162147 32.371149 64.761692
std 0.368612 24.559481 30.090047
min 0.000000 0.000000 18.250000
25% 0.000000 9.000000 35.500000
50% 0.000000 29.000000 70.350000
75% 0.000000 55.000000 89.850000
max 1.000000 72.000000 118.750000

-Here SeniorCitizen has an improper distribution of 25%-50%-75% and the reason is it being categorical

-75% customers have tenure less than 55 months

-Avg Monthly charges are $64.7

Diving into the columns

In [10]:
df['Contract'].unique()
Out[10]:
array(['Month-to-month', 'One year', 'Two year'], dtype=object)
In [11]:
df['PaymentMethod'].unique()
Out[11]:
array(['Electronic check', 'Mailed check', 'Bank transfer (automatic)',
       'Credit card (automatic)'], dtype=object)
In [12]:
df['InternetService'].unique()
Out[12]:
array(['DSL', 'Fiber optic', 'No'], dtype=object)

Data Manpiulation & Data Preprocessing

In [13]:
#Dropping customerID column since we have no need for its use  

customerID = df['customerID']  #saving into customerID so that maybe later we need to acess it
df = df.drop(['customerID'], axis = 1)
df.head()
Out[13]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
In [14]:
#time to change Total Charges column to integer type
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            7043 non-null   object 
 1   SeniorCitizen     7043 non-null   int64  
 2   Partner           7043 non-null   object 
 3   Dependents        7043 non-null   object 
 4   tenure            7043 non-null   int64  
 5   PhoneService      7043 non-null   object 
 6   MultipleLines     7043 non-null   object 
 7   InternetService   7043 non-null   object 
 8   OnlineSecurity    7043 non-null   object 
 9   OnlineBackup      7043 non-null   object 
 10  DeviceProtection  7043 non-null   object 
 11  TechSupport       7043 non-null   object 
 12  StreamingTV       7043 non-null   object 
 13  StreamingMovies   7043 non-null   object 
 14  Contract          7043 non-null   object 
 15  PaperlessBilling  7043 non-null   object 
 16  PaymentMethod     7043 non-null   object 
 17  MonthlyCharges    7043 non-null   float64
 18  TotalCharges      7043 non-null   object 
 19  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(17)
memory usage: 1.1+ MB
In [15]:
#since the TotalCharges column was an object so we are going to convert that to integer

df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce')    #It will replace all non-numeric values with NaN.
df['TotalCharges'] = df['TotalCharges'].astype("float")
df.isnull().sum()
Out[15]:
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

After converting the column to numeric (because if its in integer we can apply different mathematical operations on it) after converting to numeric data type we see that there are actually missing values (blankspaces)

In [16]:
#Rows with Null value in Total Charges - Column
df.loc[df['TotalCharges'].isnull() == True]
Out[16]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
488 Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 NaN No
753 Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 NaN No
936 Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 NaN No
1082 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 NaN No
1340 Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 NaN No
3331 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 NaN No
3826 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 NaN No
4380 Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 NaN No
5218 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 NaN No
6670 Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 NaN No
6754 Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 NaN No

We are going to drop the rows with null values and readjust our index using reset_index function

In [17]:
df.dropna(how = 'any', inplace=True)
df = df.reset_index(drop=True)
#how: any or all value. 
#any’ drops the row/column if ANY value is Null and ‘all’ drops only if ALL values are null.
#axis =1 or “columns”. 
#axis 0 for rows
In [18]:
#cross checking if there are any null values in our dataset.
df.isna().sum()
Out[18]:
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64
In [19]:
#since the SeniorCitizen column is in 0's and 1's we are going to 
#convert it to Yes or No just like other columns like Partners, Dependents, PhoneService, etc.
#to make consistensy with our data values in the other column.

df["SeniorCitizen"]= df["SeniorCitizen"].map({0: "No", 1: "Yes"})
df.head()
Out[19]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female No Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male No No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 Male No No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male No No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female No No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

Since presence of all null values are now cleared, there are no more null values in the dataset. thus, we are going to proceed with exploratory data analysis.

Using Label Encoder

In [20]:
df2 = df.copy()
In [21]:
from sklearn.preprocessing import LabelEncoder

def object_to_int(dataframe_series):
    if dataframe_series.dtype=='object':
        dataframe_series = LabelEncoder().fit_transform(dataframe_series)
    return dataframe_series
    
df = df.apply(lambda x: object_to_int(x))
df.head()
Out[21]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0 0 1 0 1 0 1 0 0 2 0 0 0 0 0 1 2 29.85 29.85 0
1 1 0 0 0 34 1 0 0 2 0 2 0 0 0 1 0 3 56.95 1889.50 0
2 1 0 0 0 2 1 0 0 2 2 0 0 0 0 0 1 3 53.85 108.15 1
3 1 0 0 0 45 0 1 0 2 0 2 2 0 0 1 0 0 42.30 1840.75 0
4 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 1 2 70.70 151.65 1

Exploratory Data Analysis

univariate analysis

In [22]:
#plotting same pie chart using matplotlib
values1=df['gender'].value_counts()
g_labels = ['Male', 'Female']

plt.figure(figsize=(8.5, 8.5))
colors = sns.color_palette('pastel')[0:5]   #bright ,etc. are some color palletes

plt.pie(values1, labels = g_labels, explode = [0.07,0],colors=colors, shadow = True)
plt.legend(title="Gender")
plt.show() 
In [23]:
#using plotly to plot a interative pie chart
values1=df['gender'].value_counts()
g_labels = ['Male', 'Female']
colors = ['lightblue', 'pink']

fig = px.pie(labels=g_labels, values=values1)

fig.add_trace(go.Pie(labels=g_labels,values=values1))

fig.update_traces(hole=0.5, hoverinfo ="label+percent", textfont_size=20, marker=dict(colors = colors))

fig.update_layout(title_text="<b>Gender Distribution</b>", annotations=[dict(text='Gender', x=0.5, y=0.5, font_size=20, showarrow=False)])
fig.show()

Customers are 49.5 % female and 50.5 % male.

In [24]:
values1=df['Churn'].value_counts()
c_labels = ['No','Yes']
colors = ['#2BE592', '#E96D73']

fig = px.pie(labels=c_labels, values=values1)

fig.add_trace(go.Pie(labels=c_labels,values=values1))

fig.update_traces(hole=0.5, hoverinfo ="label+percent", textfont_size=20, marker=dict(colors=colors))

fig.update_layout(title_text="Churn Distribution",
                  annotations=[dict(text='Churn', x=0.5, y=0.5, font_size=20, showarrow=False)])

fig.show()

26.6 % of customers switched to another firm. Around 73% of our customers preferred our services than other products and services offerred by other firms.

In [25]:
plt.figure(figsize=(6, 6))

labels =["Churn: Yes","Churn:No"]
values = [1869,5163]

labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]

colors = ['#ff6666', '#66b3ff']
colors_gender = ['#c2c2f0','#ffb3e6', '#c2c2f0','#ffb3e6']
explode = (0.3,0.3) 
explode_gender = (0.4,0.4,0.3,0.3)
textprops = {"fontsize":15}

#Plot
plt.pie(values, labels=labels,autopct='%1.1f%%',pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90, explode=explode,radius=10, textprops =textprops )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),4.3, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title('Churn Distribution w.r.t. to Gender', fontsize=15, y=1.1)

# show plot 
 
plt.axis('square') #it will make a square plot ie., its going to plot with equal axes
plt.tight_layout()
plt.show()

Here we see both genders behaved in similar fashion when it comes to migrating to another service provider.

In [26]:
graph = sns.kdeplot(df['MonthlyCharges'][(df['Churn'] == 1) ],color ='#ff528f',shade = True)
graph.legend(["Churn: Yes"], loc = 'upper right')
graph.set_ylabel('Density')
graph.set_xlabel('Monthly Charges')
graph.set_title('Churn Distribution w.r.t. Monthly Charges')
C:\Users\Najeeb\AppData\Local\Temp/ipykernel_2928/714647035.py:1: FutureWarning:



`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.


Out[26]:
Text(0.5, 1.0, 'Churn Distribution w.r.t. Monthly Charges')

More the monthly charges the higher the chances of customer churning.

In [27]:
df_corr = df.corr()
fig = go.Figure()
fig.add_trace(go.Heatmap(x = df_corr.columns,y = df_corr.index,z = np.array(df_corr),text=df_corr.values,texttemplate='%{text:.2f}'))
fig.update_layout(autosize=False,width=1000,height=1000,margin=dict(l=50,r=50,b=100,t=100,pad=4))
fig.show()

This heatmap is super useful as it helps to identify the highly correlated variables and this will allow them to streamline the feature selection process.

In [28]:
#univariate analysis
for i, predictor in enumerate(df2.drop(columns=['Churn','TotalCharges','MonthlyCharges'])):
  plt.figure(figsize=(20,8))
  plt.figure(i)
  sns.countplot(data=df2, x= predictor, hue = 'Churn')
<Figure size 1440x576 with 0 Axes>

CONCLUSION

-Electronic check are the highest churners

-No online security, no tech support, No device support category are high churners

-Non senior citizen are high churners

-Customers with Non paperless billing are high churners

example if u dont have dependent u r likely to churn.. just look at yes dependent and no dependent then compare both orange column and esee which has high value that is going to churn

In [29]:
df_yes = df2[df2['Churn']=='Yes']
fig = px.histogram(df2, x="Churn", color="Contract", barmode="group",title="<b>Contract distribution</b>")
fig.update_layout(width =800, bargap=0.1)
fig.show()

Customers with Month-to-month are likely to churn when compared to one year and two year contracts.

Monthly customers are more likely to churn because of no contract terms and they are free to go.

About 75% of customer with Month-to-Month Contract opted to move out as compared to 13% of customrs with One Year Contract and 3% with Two Year Contract

In [30]:
df_yes = df2[df2['Churn']=='Yes']
color_map = {"Yes": "#0000FF", "No": "#FF00FF"}
fig =  px.bar(df2, x="Contract",  color="Churn", barmode="group", facet_col="gender", color_discrete_map=color_map,title="<b>Contract distribution w.r.t. Gender</b>",
             category_orders={"Contarct": ["Month-to-month","One year","Two year"],"Churn": ["Yes", "No"],"gender": ["Male", "Female"]})
fig.show()

The distribution is almost same in Male and Female

In [31]:
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df_yes, x="Churn", color="Dependents", barmode="group", title="<b>Dependents distribution</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

From the graph we can intepret that customers with no dependent are more likely to churn

In [32]:
color_map = {"Yes": '#00CC96', "No": '#B6E880'}
fig = px.histogram(df2, x="Churn", color="SeniorCitizen", title="<b>Chrun distribution w.r.t. Senior Citizen</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

We can intepret that customers who are not Senior Citizen are more likely to churn

In [33]:
color_map = {"Yes": '#00CC96', "No": '#B6E880'}
fig = px.histogram(df2, x="Churn", color="PhoneService", title="<b>Chrun distribution w.r.t. Phone Service</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

Customers with phone service are more likely to churn

In [34]:
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df_yes, x="Churn", color="Partner", barmode="group", title="<b>Chrun distribution w.r.t. Partners</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

Here we see that customers with no partners are more likely to churn

In [35]:
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df_yes, x="Churn", color="PaperlessBilling",barmode="group",title="<b>Chrun distribution w.r.t. Paperless Billing</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

Customer who use paperless billing are more likely to churn

In [36]:
fig = px.histogram(df_yes, x="Churn", color="PaymentMethod",barmode="group", title="<b>Customer Payment Method distribution w.r.t. Churn</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

Major customers who moved out were having Electronic Check as Payment Method. Customers who opted for Credit-Card automatic transfer or Bank Automatic Transfer and Mailed Check as Payment Method were less likely to move out.

In [37]:
fig = px.box(df2, x='Churn', y = 'tenure')

fig.update_yaxes(title_text='Tenure (Months)')
fig.update_xaxes(title_text='Churn')

fig.update_layout(autosize=True, width=750, height=600,
    title='<b>Churn distribution w.r.t. Tenure</b>')

fig.show()

New customers are more likely to churn

(ie., we can infer than people with atleast 10 months of using the product are more likely to churn)

Machine Learning Model Evaluations and Prediction

In [38]:
from sklearn.model_selection import train_test_split

X = df.drop(columns = ['Churn'])
y = df['Churn'].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30, random_state = 40, stratify=y)
#30% will go towards testing
In [39]:
from sklearn.preprocessing import StandardScaler

num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']

scaler= StandardScaler()

# The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of X, individually, before applying any machine learning model.
# leaving variances unequal is equivalent to putting more weight on variables and it leades to elongated clusters as KNN makes clusters.

X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

#fit_transform-
#compute the mean and std dev for a given feature to be used further for scaling and perform scaling from the mean and std dev calculated
In [40]:
df.shape
Out[40]:
(7032, 20)

Using KNN

In [41]:
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error,f1_score, recall_score
In [42]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors = 8) 
knn_model.fit(X_train,y_train)
predicted_y = knn_model.predict(X_test)

accuracy_KNN = knn_model.score(X_test,y_test)
recallKNN = recall_score(y_test, predicted_y)
F1KNN = f1_score(y_test,predicted_y)
mseKNN = mean_squared_error(y_test,predicted_y)
maeKNN = mean_absolute_error (y_test,predicted_y)
print("KNN accuracy:",round(accuracy_KNN,2))
KNN accuracy: 0.78
In [43]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, predicted_y)
cm
Out[43]:
array([[1400,  149],
       [ 318,  243]], dtype=int64)
In [44]:
#total will be the total rows
#left to right diagonal will give 1345 + 291 will give correctrf predicted rows
#and right to left diagonal will be opposite

Using Decision Tree

In [45]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(criterion="entropy", max_depth =6)
dt_model.fit(X_train,y_train)
predicted_y = dt_model.predict(X_test)

accuracy_DT = dt_model.score(X_test,y_test)
recallDT = recall_score(y_test, predicted_y)
F1DT = f1_score(y_test,predicted_y)
DTmse = mean_squared_error(y_test,predicted_y)
DTmae = mean_absolute_error (y_test,predicted_y)

print("Decision Tree accuracy is :",round(accuracy_DT,2))
Decision Tree accuracy is : 0.79
In [46]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, predicted_y)
cm
Out[46]:
array([[1314,  235],
       [ 214,  347]], dtype=int64)
In [47]:
from sklearn.tree import export_graphviz
# generating decision tree image
export_graphviz(dt_model,out_file='iris_tree.dot',feature_names=df.columns[:19], class_names=df.columns[:], rounded=True, filled=True)

!dot -Tpng iris_tree.dot -o iris_tree.png
'dot' is not recognized as an internal or external command,
operable program or batch file.

Using SVM

In [48]:
from sklearn import svm

SVM_cls = svm.SVC(kernel = "linear")
SVM_cls.fit(X_train, y_train)
svmPrediction = SVM_cls.predict(X_test)

accuracysvm = accuracy_score(y_test, svmPrediction)
recallSVM = recall_score(y_test, svmPrediction)
mseSVM = mean_squared_error(y_test,svmPrediction)
maeSVM = mean_absolute_error (y_test,svmPrediction)
F1SVM =  f1_score(y_test,svmPrediction,average='binary')

print("Support Vector Classifier accuracy is :",round(accuracysvm,2))
Support Vector Classifier accuracy is : 0.8
In [49]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, svmPrediction)
cm
Out[49]:
array([[1374,  175],
       [ 253,  308]], dtype=int64)

Report

From the table below its is clear that SVM model performs better as compared to Decision tree and KNN model with an overall accuracy of 79.7% and an overall high F1-Score of 0.59. Moreover, the Mean squared error and Mean absolute error of the SVM is lower when compared to Decision Tree and KNN.

In [50]:
data_report = np.array([['Algorithm','Accuracy Score','F1-Score','Recall','Mean Square Error','Mean Absolute Error'],['KNN',accuracy_KNN ,recallKNN, F1KNN,mseKNN,maeKNN],['Decision Tree',accuracy_DT ,recallDT, F1DT,DTmse,DTmae],['SVM',accuracysvm ,recallSVM,F1SVM,mseSVM,maeSVM]])

df_report = pd.DataFrame(data = data_report[1:,1:], index = data_report[1:,0],columns = data_report[0,1:])
df_report
Out[50]:
Accuracy Score F1-Score Recall Mean Square Error Mean Absolute Error
KNN 0.7786729857819905 0.43315508021390375 0.5099685204617 0.2213270142180095 0.2213270142180095
Decision Tree 0.7872037914691943 0.6185383244206774 0.6071741032370953 0.2127962085308057 0.2127962085308057
SVM 0.7971563981042654 0.5490196078431373 0.5900383141762453 0.2028436018957346 0.2028436018957346

Mean squared error = (1/n) * Σ(actual – prediction)2 ; the lower the better

Mean absolute error = (Δx) = |xi – x|, = measured value - true value

The accuracy score is used as a measure to calculate the performance of a model. The confusion matrix is used to evaluate the model.


The mean squared error (MSE) determines the distance between the set of points and the regression line by taking the distances from the set of points to the regression line and then swapping them. Distances are nothing but errors. Squaring is only done to remove negative values and to give more weight to larger differences.

If the MSE score value is smaller it means you are very close to determining the best fit line which also depends on the data you are working on, so sometimes it may not be possible to get a small MSE score value.


Absolute Error is the amount of error in your measurements. It is the difference between the measured value and “true” value. For example, if a scale states 90 pounds but you know your true weight is 89 pounds, then the scale has an absolute error of 90 lbs – 89 lbs = 1 lbs.



Accuracy is a metric for classification models that measures the number of predictions that are correct

A precise model is very “pure”: maybe it does not find all the positives, but the ones that the model does class as positive are very likely to be correct

A model with high recall succeeds well in finding all the positive cases in the data, even though they may also wrongly identify some negative cases as positive cases

The F1 score is defined as the harmonic mean of precision and recall.

Testing:

In [51]:
def label_encode_input(item,data):
  if item == "gender":
    data = 1 if data == "Male" else 0
  elif item == "SeniorCitizen":
    data = 1 if data == "Yes" else 0
  elif item == "Partner":
    data = 1 if data == "Yes" else 0
  elif item == "Dependents":
    data = 1 if data == "Yes" else 0
  elif item == "PhoneService":
    data = 0 if data == "No" else 1
  elif item == "MultipleLines":
    data = 0 if data == "No" else ( 2 if data == "Yes" else 1)
  elif item == "InternetService":
    data = 2 if data == "No" else ( 1 if data == "Yes" else 0)
  elif item == "OnlineSecurity":
    data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
  elif item == "OnlineBackup":
    data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
  elif item == "DeviceProtection":
    data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
  elif item == "TechSupport":
    data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
  elif item == "StreamingTV":
    data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
  elif item == "StreamingMovies":
    data = 0 if data == "No" else ( 2 if data=="Yes" else 1)
  elif item == "Contract":
    data = 0 if data == "Month-to-month" else ( 1 if data=="One year" else 2)
  elif item == "PaperlessBilling":
    data = 1 if data == "Yes" else 0
  elif item == "PaymentMethod":
    data = 2 if data == "Electronic check" else ( 3 if data=="Mailed check" else (0 if data == "Bank transfer (automatic)" else 1))
  else:
    data = float(data)
  return float(data)

Accept Input from User to predict and preprocess input:

In [54]:
columns = ['gender', 'SeniorCitizen', 'Partner', 'Dependents','tenure', 'PhoneService', 'MultipleLines', 'InternetService','OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling','PaymentMethod', 'MonthlyCharges', 'TotalCharges']
X=[]
example_input = ["Female","No","Yes",	"No",	1	,"No","No phone service","DSL","No","Yes","No","No","No","No","Month-to-month","Yes","Electronic check",29.85,29.85]

print("Please Enter the data values:")
for item in columns:
  print(item," : ")
  if item == "gender":
    print("( 'Female' , 'Male' )")
  elif item == "SeniorCitizen":
    print("( 'No' , 'Yes' )")
  elif item == "Partner":
    print("('Yes', 'No')")
  elif item == "Dependents":
    print("('No', 'Yes')")
  elif item == "PhoneService":
    print("('No', 'Yes')")
  elif item == "MultipleLines":
    print("('No phone service', 'No', 'Yes')")
  elif item == "InternetService":
    print("('DSL', 'Fiber optic', 'No')")
  elif item == "OnlineSecurity":
    print("('No', 'Yes', 'No internet service')")
  elif item == "OnlineBackup":
    print("('Yes', 'No', 'No internet service')")
  elif item == "DeviceProtection":
    print("('No', 'Yes', 'No internet service')")
  elif item == "TechSupport":
    print("('No', 'Yes', 'No internet service')")
  elif item == "StreamingTV":
    print("('No', 'Yes', 'No internet service')")
  elif item == "StreamingMovies":
    print("('No', 'Yes', 'No internet service')")
  elif item == "Contract":
    print("('Month-to-month', 'One year', 'Two year')")
  elif item == "PaperlessBilling":
    print("('Yes', 'No')")
  elif item == "PaymentMethod":
    print("('Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)')")

  data = input(" --> ")

  data = label_encode_input(item,data)
  X.append(data)

print("input Feature -> ",X)
Please Enter the data values:
gender  : 
( 'Female' , 'Male' )
 --> Male
SeniorCitizen  : 
( 'No' , 'Yes' )
 --> No
Partner  : 
('Yes', 'No')
 --> No
Dependents  : 
('No', 'Yes')
 --> No
tenure  : 
 --> 13
PhoneService  : 
('No', 'Yes')
 --> No
MultipleLines  : 
('No phone service', 'No', 'Yes')
 --> No
InternetService  : 
('DSL', 'Fiber optic', 'No')
 --> DSL
OnlineSecurity  : 
('No', 'Yes', 'No internet service')
 --> No
OnlineBackup  : 
('Yes', 'No', 'No internet service')
 --> No
DeviceProtection  : 
('No', 'Yes', 'No internet service')
 --> No
TechSupport  : 
('No', 'Yes', 'No internet service')
 --> No
StreamingTV  : 
('No', 'Yes', 'No internet service')
 --> No
StreamingMovies  : 
('No', 'Yes', 'No internet service')
 --> No
Contract  : 
('Month-to-month', 'One year', 'Two year')
 --> Month-to-month
PaperlessBilling  : 
('Yes', 'No')
 --> Yes
PaymentMethod  : 
('Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)')
 --> Electronic check
MonthlyCharges  : 
 --> 12.3
TotalCharges  : 
 --> 161.2
input Feature ->  [1.0, 0.0, 0.0, 0.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 12.3, 161.2]

Prediction

In [56]:
X[4],X[-2],X[-1] = scaler.transform([[X[4],X[-2],X[-1]]])[0]
svmPrediction = SVM_cls.predict([X])
if svmPrediction[0] == 0:
  print("Customer NOT churned. -> NO")
else:
  print("Customer HAS churned. -> YES")
Customer NOT churned. -> NO